jgauthier3710@floridapoly.eduThis report explores the relationship between various attributes and how walkable a certain region of Florida is on average. For this project data was collected from a dataset called the National Walkability Index on the EPA’s website (Environmental Protection Agency, 2021). The dataset was filtered so that only districts in the state of Florida were shown and the dataset was divided into groups based on their CBSA (Core-based statistical area). Next, summary statistics were computed for the region so that the new dataset includes the average National Walkability Index score (AvgNWI), average percentage of population that is working-age (AvgP_Wrk), average district population (AvgDisPop), total district population (TotCPop), and average percentage of low-wage workers (AvgP_LowW) for all the districts in each CBSA region. Each of these attributes serves a distinct purpose in evaluating aspects related to walkability, demographics, and economic characteristics within the selected regions. I predict that areas with larger average working-aged and smaller average low-waged population percentages will have higher walkability scores on average.
My original plan was to make three plots, an interactive scatter plot, a choropleth map, and a heatmap. In the end, I decided to make these plots in addition to a coefficients plot. My first plot is an interactive scatter plot of AvgP_Wrk vs AvgP_LowW with AvgNWI color-coded. The second figure that I made is a choropleth map that shows the AvgNWI across different CBSA regions in Florida. Next, I made a coefficients plot from a multiple linear regression model predicting AvgNWI. I was motivated to create this plot because I wanted to see how various variables impact AvgNWI. Lastly, I made a heatmap showing the correlations between AvgNWI and the other selected attributes.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.3
library(terra)
## Warning: package 'terra' was built under R version 4.3.3
## terra 1.7.78
##
## Attaching package: 'terra'
##
## The following object is masked from 'package:tidyr':
##
## extract
library(htmlwidgets)
## Warning: package 'htmlwidgets' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(broom)
walkability <- read_csv("../data/EPA_SmartLocationDatabase_V3_Jan_2021_Final.csv")
## Rows: 220740 Columns: 117
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): CSA_Name, CBSA_Name
## dbl (115): OBJECTID, GEOID10, GEOID20, STATEFP, COUNTYFP, TRACTCE, BLKGRPCE,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdb_path <- "../data/Natl_WI.gdb"
layers <- st_layers(gdb_path)
layer_name <- layers$name[1]
nwi <- st_read(gdb_path, layer = layer_name)
## Reading layer `NationalWalkabilityIndex' from data source
## `C:\Users\Jackie\Downloads\dataviz_mini-project_02\dataviz_mini-project_02\dataviz_mini-project_02\data\Natl_WI.gdb'
## using driver `OpenFileGDB'
## Simple feature collection with 220739 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -10434580 ymin: -83867.97 xmax: 3407868 ymax: 6755033
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version
flwalkability <- walkability %>%
filter(STATEFP == '12') %>%
group_by(CBSA_Name) %>%
summarize(AvgNWI = mean(NatWalkInd, na.rm = TRUE),
AvgP_Wrk = mean(P_WrkAge, na.rm = TRUE),
AvgDisPop = mean(TotPop, na.rm = TRUE),
TotCPop = sum(TotPop, na.rm = TRUE),
AvgP_LowW = mean(R_PCTLOWWAGE, na.rm = TRUE)
)
flwalkability
## # A tibble: 30 × 6
## CBSA_Name AvgNWI AvgP_Wrk AvgDisPop TotCPop AvgP_LowW
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arcadia, FL 6.01 0.559 1400. 36399 0.263
## 2 Cape Coral-Fort Myers, FL 9.42 0.515 1398. 718679 0.249
## 3 Clewiston, FL 6.17 0.552 1605. 40127 0.255
## 4 Crestview-Fort Walton Beach-Dest… 7.26 0.589 1646. 266595 0.258
## 5 Deltona-Daytona Beach-Ormond Bea… 10.1 0.554 1862. 634773 0.270
## 6 Gainesville, FL 9.11 0.624 1628. 320724 0.253
## 7 Homosassa Springs, FL 6.87 0.481 1626. 143087 0.267
## 8 Jacksonville, FL 10.2 0.601 2093. 1475386 0.245
## 9 Key West, FL 10.9 0.601 1004. 76325 0.235
## 10 Lake City, FL 6.49 0.570 1728. 69105 0.271
## # ℹ 20 more rows
# Join the summarized data to the map for Florida
florida_nwi <- nwi[nwi$STATEFP == '12', ]
florida_nwi <- florida_nwi[!is.na(florida_nwi$NatWalkInd), ]
users_map <- florida_nwi %>%
left_join(flwalkability, by = "CBSA_Name")
users_map
## Simple feature collection with 11442 features and 34 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 796752.4 ymin: 259071.7 xmax: 1612207 ymax: 961154.4
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version
## First 10 features:
## GEOID10 GEOID20 STATEFP COUNTYFP TRACTCE BLKGRPCE CSA
## 1 121170221051 121170221051 12 117 022105 1 422
## 2 120710104093 120710104093 12 071 010409 3 163
## 3 120710104104 120710104104 12 071 010410 4 163
## 4 120710104103 120710104103 12 071 010410 3 163
## 5 120860003071 120860003071 12 086 000307 1 370
## 6 120950147031 120950147031 12 095 014703 1 422
## 7 120330036144 120330036144 12 033 003614 4 426
## 8 120339900000 120339900000 12 033 990000 0 426
## 9 120710104041 120710104041 12 071 010404 1 163
## 10 120710104042 120710104042 12 071 010404 2 163
## CSA_Name CBSA
## 1 Orlando-Lakeland-Deltona, FL 36740
## 2 Cape Coral-Fort Myers-Naples, FL 15980
## 3 Cape Coral-Fort Myers-Naples, FL 15980
## 4 Cape Coral-Fort Myers-Naples, FL 15980
## 5 Miami-Port St. Lucie-Fort Lauderdale, FL 33100
## 6 Orlando-Lakeland-Deltona, FL 36740
## 7 Pensacola-Ferry Pass, FL-AL 37860
## 8 Pensacola-Ferry Pass, FL-AL 37860
## 9 Cape Coral-Fort Myers-Naples, FL 15980
## 10 Cape Coral-Fort Myers-Naples, FL 15980
## CBSA_Name Ac_Total Ac_Water Ac_Land
## 1 Orlando-Kissimmee-Sanford, FL 377.49665 0.00000 377.49665
## 2 Cape Coral-Fort Myers, FL 691.21604 25.29563 665.92042
## 3 Cape Coral-Fort Myers, FL 625.08763 53.62688 571.46076
## 4 Cape Coral-Fort Myers, FL 1610.68231 191.16595 1419.51636
## 5 Miami-Fort Lauderdale-Pompano Beach, FL 99.85187 0.00000 99.85187
## 6 Orlando-Kissimmee-Sanford, FL 423.49975 18.13615 405.36360
## 7 Pensacola-Ferry Pass-Brent, FL 3826.38748 15.06668 3811.32081
## 8 Pensacola-Ferry Pass-Brent, FL 75522.48436 75522.48436 0.00000
## 9 Cape Coral-Fort Myers, FL 816.05813 70.99751 745.06061
## 10 Cape Coral-Fort Myers, FL 804.52992 56.96297 747.56695
## Ac_Unpr TotPop CountHU HH Workers D2B_E8MIXA D2A_EPHHM D3B
## 1 377.49665 1571 747 643 1012 0.6101164 0.3411523 91.587780
## 2 665.92042 1880 877 688 876 0.5984436 0.4112783 40.060523
## 3 571.46076 1875 807 807 901 0.5047176 0.5293845 69.835907
## 4 1419.51636 4061 1971 1582 1632 0.5229428 0.2998326 46.905440
## 5 99.85187 1658 384 364 619 0.2788802 0.1725376 126.068339
## 6 405.36360 2491 1155 982 1395 0.5964563 0.5454343 97.397991
## 7 3807.37991 2032 688 524 563 0.7117882 0.7462556 6.047835
## 8 0.00000 0 0 0 0 0.0000000 0.0000000 0.000000
## 9 745.06061 2809 1151 905 1224 0.6974998 0.5202777 53.853551
## 10 747.56695 2297 1073 938 1297 0.5069232 0.2965802 77.940310
## D4A D2A_Ranked D2B_Ranked D3B_Ranked D4A_Ranked NatWalkInd
## 1 -99999.00 6 12 13 1 7.666667
## 2 -99999.00 8 12 8 1 6.333333
## 3 -99999.00 11 8 11 1 7.166667
## 4 807.35 5 9 9 14 10.000000
## 5 355.40 2 3 16 17 11.833333
## 6 584.73 12 11 14 15 13.500000
## 7 -99999.00 17 16 4 1 7.166667
## 8 -99999.00 1 1 1 1 1.000000
## 9 -99999.00 11 15 10 1 8.000000
## 10 -99999.00 5 8 12 1 6.500000
## Shape_Length Shape_Area AvgNWI AvgP_Wrk AvgDisPop TotCPop AvgP_LowW
## 1 5858.584 1527709 10.190048 0.6075743 2937.963 2450261 0.2433612
## 2 6683.597 2797313 9.420233 0.5153093 1398.208 718679 0.2486237
## 3 6436.426 2529690 9.420233 0.5153093 1398.208 718679 0.2486237
## 4 10926.322 6518339 9.420233 0.5153093 1398.208 718679 0.2486237
## 5 2543.113 404094 12.707212 0.5918418 1775.130 6070944 0.2214451
## 6 5224.096 1713882 10.190048 0.6075743 2937.963 2450261 0.2433612
## 7 18751.923 15485177 9.039653 0.5996245 1791.688 481964 0.2577461
## 8 128518.435 305636842 9.039653 0.5996245 1791.688 481964 0.2577461
## 9 8180.276 3302543 9.420233 0.5153093 1398.208 718679 0.2486237
## 10 7837.987 3255888 9.420233 0.5153093 1398.208 718679 0.2486237
## Shape
## 1 MULTIPOLYGON (((1433116 731...
## 2 MULTIPOLYGON (((1398559 499...
## 3 MULTIPOLYGON (((1398823 497...
## 4 MULTIPOLYGON (((1396073 493...
## 5 MULTIPOLYGON (((1589877 448...
## 6 MULTIPOLYGON (((1419643 714...
## 7 MULTIPOLYGON (((825795.4 87...
## 8 MULTIPOLYGON (((813914 8334...
## 9 MULTIPOLYGON (((1399432 501...
## 10 MULTIPOLYGON (((1400114 499...
# Create the base ggplot
my_plot <- ggplot(
data = flwalkability,
mapping = aes(x = AvgP_Wrk, y = AvgP_LowW, color = AvgNWI)) +
geom_point(aes(text = paste(
"CBSA Name: ", CBSA_Name, "<br>",
"Average District Population: ", AvgDisPop
)), size = 4) +
scale_color_viridis_c() +
labs(
title = "Average Portion of the Population that is Working Age vs Low Wage",
x = "Average Portion of the Population that is Working Age",
y = "",
color = "AvgNWI"
) +
theme_minimal()
## Warning in geom_point(aes(text = paste("CBSA Name: ", CBSA_Name, "<br>", :
## Ignoring unknown aesthetics: text
# Convert the ggplot to an interactive plotly plot
interactive_plot <- ggplotly(my_plot, tooltip = "text")
interactive_plot
The original plan for the plot shown above is to make an interactive plot that shows the relationship between the average working-age population (AvgP_Wrk) and average low-wage population (AvgP_LowW) across different CBSA regions in Florida, with points color-coded by the average National Walkability Index (AvgNWI). I wanted to display the CBSA name of the point and the average district population when I hover my mouse over the points. To make this plot interactive, plotly was used. This plot was the easiest plot to create and I did not encounter any difficulties creating it. An additional approach I could implement to explore this data further is to add a trend line to highlight patterns in the scatterplot. I could also add more information when I hover over each point.
This plot allows us to exploration of how characteristics like working-age population and low-wage employment correlate across different areas. Additionally, this plot tells the story of how these characteristics affect the national walkability index scores. From the graph, it appears that low-wage employment is negatively correlated with walkability, and the working-age population is positively correlated with walkability. This information can be used to influence policies related to employment and urban planning. One way I applied data science principles to this plot is I keeping the color the same for the AvgNWI variable as the second plot. Additionally, the graph is kept minimal and the points are sized so that they are easy to interact with. The labels on each point are easy to interact with and understand.
ggplot(data = users_map) +
geom_sf(aes(fill = AvgNWI), size = 0.1, color = "gray80") +
scale_fill_viridis_c(option = "viridis", name = "Walkability Index") +
labs(title = "Average National Walkability Index in CBSA Regions of Florida",
caption = "The Gray areas on the Graph represent areas without walkability Data") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.caption = element_text(hjust = 1)
) +
guides(
fill = guide_colorbar(
title.position = "top",
title.hjust = 0.5,
title.vjust = 1,
title.theme = element_text(size = 8)
))
The original chart planned for this figure was a choropleth map displaying the average National Walkability Index (AvgNWI) across CBSA regions in Florida. For this chart, the walkability data was merged with the geographic data for the state of Florida. To make this graph, I tested a variety of colors to see which color would suit the missing data the best. I decided that light gray would be the best color because it was the least distracting. Figuring out the best color scheme was the main difficulty that I encountered when making this graph. Additionally, I initially had some difficulty merging the data to create this graph. One thing I could add to the graph is a label for the most walkable area.
This map tells the story of how walkable different regions of Florida are on average. This plot can be used by policymakers to identify regions with high walkability scores in Florida. They could then look into the urban planning policies in areas with a high walkability score. The graph used a variety of data visualization principles including keeping a consistent color scheme/gradient to represent the AvgNWI values effectively. Additionally, the design is kept minimal so the data can be focused on the map.
# Load necessary libraries
library(broom)
# Fit multiple linear regression model
model <- lm(AvgNWI ~ AvgP_Wrk + AvgDisPop + TotCPop + AvgP_LowW, data = flwalkability)
# Use broom::tidy to extract coefficients and their confidence intervals
coefficients <- tidy(model, conf.int = TRUE) %>%
filter(term != "(Intercept)") # Remove intercept from plotting
# Plot coefficients with confidence intervals
ggplot(coefficients, aes(x = estimate, y = fct_rev(term))) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0, color = "violet") +
labs(
title = "Coefficients of Multiple Linear Regression Model",
x = "Coefficient",
y = ""
) +
theme_minimal()
This plot displays the coefficients with confidence intervals from a multiple linear regression model predicting AvgNWI using AvgP_Wrk, AvgDisPop, TotCPop, and AvgP_LowW. To make this plot, a multiple linear regression model called model was created that used AvgNWI as the dependent variable and the other variables as predictors. Then the coefficients and their confidence intervals were extracted. The main difficulty that I encountered creating this plot was deciding which plot based on a model that I wanted to create. One additional piece of information that I could add to the plots is the p-values. I was motivated to create this plot because using a multiple linear regression model is an excellent way to understand the impact each predictor variable has on the AvgNWI variable. This approach not only quantifies their impact but also aids in predicting AvgNWI for new areas or scenarios.
This plot tells the story of how much of an impact each of the predictor variables has on the average National Walkability Index value (AvgNWI). By displaying the coefficients, the plot provides insights into which factors most strongly influence the walkability of an area. This plot visually represents the impact of each predictor variable on the National Walkability Index (AvgNWI), providing insights into which factors most strongly influence walkability. The plot shows that the AvgP_Wrk and AvgP_LowW have the most impact on the walkability with the other two variables having zero impact on the walkability. One data visualization principle that was applied in this graph is minimalism. Additionally, I used color coding to draw the viewer’s attention to the zero coefficient line.
# Calculate the correlation matrix, convert it to a long format for ggplot, and add rounded correlation values
cor_matrix <- cor(flwalkability %>% select(-CBSA_Name), use = "complete.obs")
cor_matrix_long <- as.data.frame(as.table(cor_matrix))
cor_matrix_long$nice_cor <- round(cor_matrix_long$Freq, 2)
# Plot the heatmap
heatmap_plot <- ggplot(cor_matrix_long, aes(x = Var2, y = Var1, fill = Freq)) +
geom_tile() +
geom_text(aes(label = nice_cor), color = "black", size = 4) +
scale_fill_gradient2(
low = "#F63719",
mid = "white",
high = "#3B9AB2",
limits = c(-1, 1)
) +
labs(x = NULL, y = NULL,
title = "Heat Map of the Correlation Matrix") +
coord_equal() +
theme_minimal() +
theme(panel.grid = element_blank())
heatmap_plot
The original figure planned here was a heatmap displaying the correlation matrix among AvgNWI, AvgP_Wrk, AvgDisPop, TotCPop, and AvgP_LowW. For this heat map, the correlation coefficient was first computed, converted into a long format, and rounded for ggplot visualization. This plot visualizes the strength and direction of correlations among variables and helps us identify potential relationships and dependencies. The main difficulty that I encountered with this plot was deciding whether or not to include it. In the end, I decided I would keep it because it adds to the previous graphs.
This plot tells the story of how correlated the variables being analyzed in this report are to each other. Some of the data visualization principles that are applied in this graph include utilizing a color scheme that is good for positive versus negative values. Additionally, I used a minimal theme and only labeled parts that needed labels.
The findings and visualizations generally confirm assumptions about the relationships between walkability and the predictor variables. Higher walkability tends to correlate with higher working-age populations percentages and slightly lower proportions of low-wage workers. This suggests that areas with better walkability might attract a more economically active and potentially higher-earning population. Higher walkability areas also tend to have higher populations.
U.S. Environmental Protection Agency. (2021). National Walkability Index. Data.gov. Retrieved June 15, 2024, from https://catalog.data.gov/dataset/walkability-index1/
U.S. Environmental Protection Agency. (2021). Smart location mapping. Retrieved June 15, 2024, from https://www.epa.gov/smartgrowth/smart-location-mapping
Healy, K. (2019). Data visualization: A practical introduction. Retrieved from https://socviz.co/refineplots.html#refineplots